

<!DOCTYPE html>
<html lang="zh-CN">

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">
  <meta http-equiv="X-UA-Compatible" content="ie=edge">
  <title>数据分析与挖掘（一） - cxp&#39;s blog</title>
  <meta name="apple-mobile-web-app-capable" content="yes" />
  <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
  <meta name="google" content="notranslate" />

  
  
  <meta name="description" content=" Q1、数据仓库（Data Warehouse, DW..."> 
  
  <meta name="author" content="Alex"> 

  
    <link rel="icon" href="/images/icons/favicon-16x16.png" type="image/png" sizes="16x16">
  
  
    <link rel="icon" href="/images/icons/favicon-32x32.png" type="image/png" sizes="32x32">
  
  
    <link rel="apple-touch-icon" href="/images/icons/apple-touch-icon.png" sizes="180x180">
  
  
    <meta rel="mask-icon" href="/images/icons/stun-logo.svg" color="#333333">
  
  
    <meta rel="msapplication-TileImage" content="/images/icons/favicon-144x144.png">
    <meta rel="msapplication-TileColor" content="#000000">
  

  <link rel="stylesheet" href="/css/style.css">

  
  <link rel="stylesheet" href="//at.alicdn.com/t/font_1445822_h1619vhl1nr.css">
  

  
  
  <link rel="stylesheet" href="https://cdn.bootcss.com/fancybox/3.5.7/jquery.fancybox.min.css">
  

  
  
  <link rel="stylesheet" href="https://cdn.bootcss.com/highlight.js/9.18.1/styles/xcode.min.css">
  

  <script>
    var CONFIG = window.CONFIG || {};
    var ZHAOO = window.ZHAOO || {};
    CONFIG = {
      isHome: false,
      fancybox: true,
      pjax: false,
      lazyload: {
        enable: true,
        loadingImage: '',
      },
      donate: {
        enable: true,
        alipay: 'https://pic.izhaoo.com/alipay.jpg',
        wechat: 'https://pic.izhaoo.com/wechat.jpg'
      },
      motto: {
        api: '',
        default: '我在开了灯的床头下，想问问自己的心啊。'
      },
      galleries: {
        enable: true
      },
      fab: {
        enable: true,
        alwaysShow: false
      },
      carrier: {
        enable: true
      },
      daovoice: {
        enable: true
      }
    }
  </script>

  

  
<link rel="alternate" href="/atom.xml" title="cxp's blog" type="application/atom+xml">
</head>
<body class="lock-screen">
  <div class="loading"></div>
  


<nav class="navbar">
  <div class="left"></div>
  <div class="center">数据分析与挖掘（一）</div>
  <div class="right">
    <i class="iconfont iconmenu j-navbar-menu"></i>
  </div>
</nav>

  <nav class="menu">
  <div class="menu-wrap">
    <div class="menu-close">
      <i class="iconfont iconbaseline-close-px"></i>
    </div>
    <ul class="menu-content">
      
      
      
      
      <li class="menu-item"><a href="/ " class="underline"> 首页</a></li>
      
      
      
      
      <li class="menu-item"><a href="/galleries " class="underline"> 摄影</a></li>
      
      
      
      
      <li class="menu-item"><a href="/archives " class="underline"> 归档</a></li>
      
      
      
      
      <li class="menu-item"><a href="/tags " class="underline"> 标签</a></li>
      
      
      
      
      <li class="menu-item"><a href="/categories " class="underline"> 分类</a></li>
      
      
      
      
      <li class="menu-item"><a href="/about " class="underline"> 关于</a></li>
      
    </ul>
    <div class="menu-copyright"><p>Powered by <a target="_blank" href="https://hexo.io">Hexo</a>  |  Theme - <a target="_blank" href="https://github.com/izhaoo/hexo-theme-zhaoo">zhaoo</a></p></div>
  </div>
</nav>
  <main id="main">
  <div class="container" id="container">
    <article class="article">
  <div class="wrap">
    <section class="head">
  <img   class="lazyload" data-original="/images/theme/post-image.jpg" src=""  draggable="false">
  <div class="head-mask">
    <h1 class="head-title">数据分析与挖掘（一）</h1>
    <div class="head-info">
      <span class="post-info-item"><i class="iconfont iconcalendar"></i>三月 24, 2019</span
        class="post-info-item">
      
      <span class="post-info-item"><i class="iconfont iconfont-size"></i>1625</span>
    </div>
  </div>
</section>
    <section class="main">
      <section class="content">
        <h1 id="q1-数据仓库data-warehouse-dw和数据库的区别"><a class="markdownIt-Anchor" href="#q1-数据仓库data-warehouse-dw和数据库的区别"></a> Q1、数据仓库（Data Warehouse, DW）和数据库的区别</h1>
<blockquote>
<p>数据仓库是一个很大的数据存储集合，出于企业的分析性报告和决策支持目的而创建。对于多种的业务数据进行筛选和整合，为企业提供一定的BI能力，知道业务流程改进、监视时间、成本、质量、控制。</p>
</blockquote>
<ul>
<li>数据仓库的输入方是各种各样的数据源，最终的输出用于企业的数据分析、数据挖掘、数据报表等方向。</li>
<li>数据仓库可以对数据库存储的同一主题的数据进行整合处理</li>
<li>不同源的数据整合依靠的是ETL,指的是Extract-Transform-Load过程，描述的是将数据从来源迁移到目标的过程（数据抽取-数据转换-数据加载）</li>
</ul>
<p>数据仓库：Hive、Teradata、Oracle、Db2等</p>
<h1 id="q2-数据分析怎么获取数据"><a class="markdownIt-Anchor" href="#q2-数据分析怎么获取数据"></a> Q2、数据分析：怎么获取数据？</h1>
<ul>
<li>从数据仓库中获取</li>
<li>数据监测，如传感器等的数据来源</li>
<li>利用爬虫获取 urllib、urllib2、request、scrapy等各种抓取包，爬虫涉及分布式爬虫、反爬虫技术等。</li>
<li>填写、埋点、日志。埋点记录用户与互联网交互过程；日志记录的信息更加精简，同时方便定位问题。一般以文件存储。包括前端日志、后端日志，前端日志需要加载到后端查看。</li>
<li>分析已有数据来进行计算，获得具有实际意义的数据。</li>
</ul>
<h1 id="q3-学习资源"><a class="markdownIt-Anchor" href="#q3-学习资源"></a> Q3、学习资源</h1>
<ul>
<li>数据学习网站kaggle、天池、</li>
<li>数据网站：ImageNet/Open Images、</li>
<li>统计数据（统计局、公司财报、政府机构等）</li>
</ul>
<h1 id="q4-怎么进行数据探索"><a class="markdownIt-Anchor" href="#q4-怎么进行数据探索"></a> Q4、怎么进行数据探索</h1>
<p><strong>拿到一个数据集，应该首先了解相关文档，熟悉文件中的统计名称。</strong></p>
<p>对于数据集中的人数据我们可以先研究数据的分布趋势：<code>集中趋势、离中趋势</code></p>
<ol>
<li>
<p>集中趋势：</p>
<ul>
<li>均值</li>
<li>中位数</li>
<li>分位数: <code>(n+1)*[0.25,0.5,0.75]</code>求解时要根据数据个数的奇偶分为求算</li>
<li>众数</li>
</ul>
</li>
<li>
<p>离中趋势：</p>
<ul>
<li>标准差 std()</li>
<li>方差 var()</li>
<li>1倍std（69%） 1.96倍std（95%）2.58倍std（99%）</li>
</ul>
</li>
<li>
<p>数据分布：</p>
<ul>
<li>偏态系数：正偏、负偏</li>
<li>峰态系数（对集中强度的衡量，数值越大峰值越尖）：正态分布的数据峰态系数一般是3，一般相差大于2（&lt;1,&gt;5）的话，那么就可以认为不是正态分布了。</li>
</ul>
</li>
<li>
<p>常用分布为：</p>
<ul>
<li>正态分布（标准正态分布）</li>
<li>t分布</li>
<li>F分布</li>
<li>卡方分布</li>
</ul>
</li>
<li>
<p>抽样:一是全量计算的成本比较大、二是不要进行全量计算就可以满足需求</p>
<ul>
<li>分层抽样</li>
<li>等距抽样</li>
<li>…</li>
</ul>
</li>
<li>
<p>数据分类：</p>
<ul>
<li>定类：类别间没有差距的，</li>
<li>定序分类：数据间有了差距</li>
<li>定距分类：没有绝对零点。不能进行相应计算，如温度</li>
<li>定比分类：可以界定数据大小，如常用的度量数据等</li>
</ul>
</li>
<li>
<p>单属性分析</p>
<ul>
<li>异常值分析（连续使用分位数来求、离散、知识异常值）</li>
<li>对比分析（绝对数比较（数字直接进行比较）、相对数比较（结构部分和整体、比例（整体内的不同比例比较）、（比较，同质进行比较）、动态、强度（密度、人均等））怎么比较时间下进行比较（同比、环比）、空间（城市、部门、公司）、经验与计划（实施进度和排期的比较）</li>
<li>结构分析（部分与总体）（静态、动态）</li>
<li>分布分析：直接概率分布、判断是不是正态分布</li>
<li>极大似然</li>
</ul>
</li>
<li>
<p>相关API</p>
<ul>
<li>max()</li>
<li>min()</li>
<li>median()中位数</li>
<li>var()方差</li>
<li>std()标准差</li>
<li>skew()偏态系数</li>
<li>kurt()峰态系数</li>
<li>quantitle()分位数</li>
<li>fillna()</li>
<li>dropna(axis=0，how=‘any’)</li>
<li>value_counts()可以添加bins左开有闭(]</li>
<li>value_counts(normalize=True)按照比例进行分析</li>
<li>sort_index()</li>
<li>直方图histogram(value, bins=10);分成几份</li>
<li>histogram(value, bins=np.arrange(0,1,0.1))  左闭右开[)</li>
<li>groupby()分组  可以根据根据需要选择聚合方式。</li>
<li>apply()</li>
<li>loc()切片</li>
</ul>
</li>
</ol>

      </section>
      <section class="extra">
        
        <ul class="copyright">
  
  <li><strong>本文作者：</strong>Alex</li>
  <li><strong>本文链接：</strong><a href="https://cxpeng.cn/archives/9c2ffa7.html">https://cxpeng.cn/archives/9c2ffa7.html</a></li>
  <li><strong>版权声明：</strong>本博客所有文章均采用<a href="https://creativecommons.org/licenses/by-nc-sa/4.0/deed.zh"
      rel="external nofollow" target="_blank"> BY-NC-SA </a>许可协议，转载请注明出处！</li>
  
</ul>
        
        
        <section class="donate">
  <div class="qrcode">
    <img   class="lazyload" data-original="https://pic.izhaoo.com/alipay.jpg" src="" >
  </div>
  <div class="icon">
    <a href="javascript:;" target="_blank" rel="noopener" id="alipay"><i class="iconfont iconalipay"></i></a>
    <a href="javascript:;" target="_blank" rel="noopener" id="wechat"><i class="iconfont iconwechat-fill"></i></a>
  </div>
</section>
        
        
  <ul class="tag-list" itemprop="keywords"><li class="tag-list-item"><a class="tag-list-link" href="/tags/Python/" rel="tag">Python</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/" rel="tag">数据分析</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98/" rel="tag">数据挖掘</a></li></ul>

        
<nav class="nav">
  
    <a href="/archives/843f592a.html"><i class="iconfont iconleft"></i>Linux基础（一）</a>
  
  
    <a href="/archives/45d22ee3.html">用Appium+Python爬取朋友圈实习信息<i class="iconfont iconright"></i></a>
  
</nav>

      </section>
      
      <section class="comments">
  
  <div class="btn" id="comments-btn">查看评论</div>
  
  
</section>
      
    </section>
  </div>
</article>
  </div>
</main>
  <footer class="footer">
  <div class="footer-social">
    
    
    
    
    
    <a href="tencent://message/?Menu=yes&uin=894519210 " target="_blank" onMouseOver="this.style.color= '#12B7F5'"
      onMouseOut="this.style.color='#33333D'">
      <i class="iconfont footer-social-item  iconQQ "></i>
    </a>
    
    
    
    
    
    <a href="javascript:; " target="_blank" onMouseOver="this.style.color= '#09BB07'"
      onMouseOut="this.style.color='#33333D'">
      <i class="iconfont footer-social-item  iconwechat-fill "></i>
    </a>
    
    
    
    
    
    <a href="https://www.instagram.com/izhaoo/ " target="_blank" onMouseOver="this.style.color= '#DA2E76'"
      onMouseOut="this.style.color='#33333D'">
      <i class="iconfont footer-social-item  iconinstagram "></i>
    </a>
    
    
    
    
    
    <a href="https://github.com/izhaoo " target="_blank" onMouseOver="this.style.color= '#24292E'"
      onMouseOut="this.style.color='#33333D'">
      <i class="iconfont footer-social-item  icongithub-fill "></i>
    </a>
    
    
    
    
    
    <a href="mailto:izhaoo@163.com " target="_blank" onMouseOver="this.style.color='#FFBE5B'"
      onMouseOut="this.style.color='#33333D'">
      <i class="iconfont footer-social-item  iconmail"></i>
    </a>
    
  </div>
  <div class="footer-copyright"><p>Powered by <a target="_blank" href="https://hexo.io">Hexo</a>  |  Theme - <a target="_blank" href="https://github.com/izhaoo/hexo-theme-zhaoo">zhaoo</a></p></div>
</footer>
  
      <div class="fab fab-plus">
    <i class="iconfont iconplus"></i>
  </div>
  
  <div class="fab fab-daovoice">
    <i class="iconfont iconcomment"></i>
  </div>
  
  <div class="fab fab-up">
    <i class="iconfont iconcaret-up"></i>
  </div>
  
<script type="text/x-mathjax-config">
    MathJax.Hub.Config({
        tex2jax: {
            inlineMath: [ ["$","$"], ["\\(","\\)"] ],
            skipTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code'],
            processEscapes: true
        }
    });
    MathJax.Hub.Queue(function() {
        var all = MathJax.Hub.getAllJax();
        for (var i = 0; i < all.length; ++i)
            all[i].SourceElement().parentNode.className += ' has-jax';
    });
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML"></script>
</body>

<script src="https://cdn.bootcss.com/jquery/3.4.1/jquery.min.js"></script>




<script src="https://cdn.bootcdn.net/ajax/libs/jquery.lazyload/1.9.1/jquery.lazyload.min.js"></script>




<script src="https://cdn.bootcss.com/fancybox/3.5.7/jquery.fancybox.min.js"></script>




<script src="/js/utils.js"></script>
<script src="/js/modules.js"></script>
<script src="/js/zui.js"></script>
<script src="/js/script.js"></script>




<script>
  (function (i, s, o, g, r, a, m) {
    i["DaoVoiceObject"] = r;
    i[r] = i[r] || function () {
      (i[r].q = i[r].q || []).push(arguments)
    }, i[r].l = 1 * new Date();
    a = s.createElement(o), m = s.getElementsByTagName(o)[0];
    a.async = 1;
    a.src = g;
    a.charset = "utf-8";
    m.parentNode.insertBefore(a, m)
  })(window, document, "script", ('https:' == document.location.protocol ? 'https:' : 'http:') +
    "//widget.daovoice.io/widget/0f81ff2f.js", "daovoice")
  daovoice('init', {
    app_id: "abcdefg"
  }, {
    launcher: {
      disableLauncherIcon: true,
    },
  });
  daovoice('update');
</script>



<script>
  (function () {
    var bp = document.createElement('script');
    var curProtocol = window.location.protocol.split(':')[0];
    if (curProtocol === 'https') {
      bp.src = 'https://zz.bdstatic.com/linksubmit/push.js';
    } else {
      bp.src = 'http://push.zhanzhang.baidu.com/push.js';
    }
    var s = document.getElementsByTagName("script")[0];
    s.parentNode.insertBefore(bp, s);
  })();
</script>


<script>
  var _hmt = _hmt || [];
  (function () {
    var hm = document.createElement("script");
    hm.src = "https://hm.baidu.com/hm.js?4c204d8bc027a0455b5fc642ac334ca8";
    var s = document.getElementsByTagName("script")[0];
    s.parentNode.insertBefore(hm, s);
  })();
</script>










</html>